Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avro metrics support: create MetricsAwareDatumWriter and some refactors for Avro #1946

Merged
merged 2 commits into from
Feb 3, 2021

Conversation

yyanyy
Copy link
Contributor

@yyanyy yyanyy commented Dec 16, 2020

This change is a smaller PR broken down from #1935. There is no change in behavior from this PR. It covers the following:

  • add comparator for byte array
  • updated field metrics bound type to allow object to byte buffer translation to occur later
  • add metrics() method to ValueWriter for Avro, currently default to empty stream
  • create MetricsAwareDatumWriter that exposes writer metrics, and replace DatumWriter with it in various classes
  • add metrics config to avro writer and builder
  • create a AvroMetrics class that resembles current behavior for producing metrics for avro writer

@rdblue
Copy link
Contributor

rdblue commented Dec 17, 2020

Thanks @yyanyy, I'll take a look at this one soon!

private final ByteBuffer lowerBound;
private final ByteBuffer upperBound;
private final Object lowerBound;
private final Object upperBound;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this to Object rather than ByteBuffer? Seems like conversion to ByteBuffer would be cleaner if this was done in each writer because the writer already has its type because it is going to call the right method on the encoder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we convert this to ByteBuffer now we may still need to check type when doing truncation (based on metrics mode), and I think string with non-unicode characters will not vend the same result if truncated by BinaryUtil.truncateBinary, so we will either convert the byte buffer back to char sequence and use UnicodeUtil.truncateString or create a new BinaryUtil.truncateString. Whereas if we do conversion later when evaluating metrics, I think the code needed for the conversion itself isn't that bad since we know the type of the field, and that's the reason for me to do this change.

But one thing that may worth noting is that for the current approach, in order ensure the Conversions.toByteBuffer could work, for certain writers I have to make sure the min/max from the value writers return the type that Conversions.toByteBuffer knows how to translate, if the data type in write is not of that type (that is, usage of this method). I think we still need to maintain a similar function for translation in each value writer if we return bytebuffer for bounds in field metrics, but it will directly translate input data type to byte buffer instead of doing two hops, and that might be easier to understand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I was thinking that truncation would happen when FieldMetrics is constructed, in the leaf writers. If that's not the case, then I think it makes sense to do the conversion later.

If the conversion happens later, then I think this class should be parameterized. I never like to have classes that track just Object. We should at least guarantee that both lower and upper bounds are the same type, for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mostly following the pattern of ORC and Parquet to evaluate metrics mode when collecting metrics (which has to be since the file formats collects stats themselves), but I think there's nothing prevent us from ingesting metrics mode during value writers creation, it will just make the visitor pattern a little bit more complicated. I'll give it a try, and thanks for bringing up this idea!

I guess for now I'll revert the change to FieldMetrics in this PR and include it in the next one that updates value writers if we need to change it. Hopefully that doesn't add too much to the next PR!

@@ -157,6 +157,10 @@ public int compare(List<T> o1, List<T> o2) {
return UnsignedByteBufComparator.INSTANCE;
}

public static Comparator<byte[]> unsignedByteArray() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other method names are plural. Could we use unsignedByteArrays()?

@@ -123,7 +127,7 @@ public WriteBuilder named(String newName) {
return this;
}

public WriteBuilder createWriterFunc(Function<Schema, DatumWriter<?>> writerFunction) {
public WriteBuilder createWriterFunc(Function<Schema, MetricsAwareDatumWriter<?>> writerFunction) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to break existing uses of createWriterFunc in projects that build on Iceberg. I think this should keep the old parameter and just check whether the implementation is MetricsAwareDatumWriter in the appender to return metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I didn't think of the case where people have their own implementation of these interfaces so I totally missed this. Will update and keep in mind!

@github-actions github-actions bot added the data label Dec 18, 2020
@rdblue
Copy link
Contributor

rdblue commented Dec 28, 2020

@yyanyy, can you rebase this to fix the conflict?

@yyanyy yyanyy requested a review from rdblue January 5, 2021 22:41
Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for rebasing, it looks good to me

}
}

// if there are no differences, then the shorter seq is first
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "the shorter seq is first" is a bit confusing to me, maybe "is smaller" is a better word.


@Override
public Stream<FieldMetrics> metrics() {
return Stream.concat(PATH_WRITER.metrics(), POS_WRITER.metrics());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also include metrics from the rowWriter, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be fixed in a follow-up.

CodecFactory codec, Map<String, String> metadata) throws IOException {
DataFileWriter<D> writer = new DataFileWriter<>(
(DatumWriter<D>) createWriterFunc.apply(schema));
(DatumWriter<D>) metricsAwareDatumWriter);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: I don't think this needs to be a MetricsAwareDatumWriter, right? It isn't in the type signature, so we should name it just datumWriter.

@rdblue rdblue merged commit 6cc5d99 into apache:master Feb 3, 2021
@rdblue
Copy link
Contributor

rdblue commented Feb 3, 2021

Thanks, @yyanyy! This looks good now so I merged it. That should unblock the next steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants